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ENTROPY  INTERPRETATION  OF  GOODNESS  OF  FIT  TESTS 


Emanuel  Parzen 
Institute  of  Statistics 
Texas  A&M  University 


ABSTRACT .  This  paper  describes  a  synthesis  of  statistical  reasoning 
called  FUN.STAT  (because  it  is  fun;  functional  (useful);  based  on  functional 
analysis;  estimates  functions;  and  all  graphs  are  of  functions).  FUN.STAT  has 
three  important  components:  quantile  and  density-quantile  signatures  of 
populations,  entropy  and  information  measures,  and  functional  statistical 
inference. 

A  FUN.STAT  approach  to  the  problem  of  identifying  the  probability 
distribution  F(x)  of  a  random  variable  X  from  a  random  sample  is  outlined. 

To  identify  FQ  in  the  location- scale  parameter  model  F(x)  =  F0((x-y)/o),  we 

estimate  entropy  difference  A  =  H°(f)  -  H(f).  H(f)  is  Shannon  entropy  and 

H°(f)  *  log  a  +  H(f  )  is  entropy  of  the  assumed  model  (which  may  maximize 

y  *  a  * 

entropy).  Estimators  H1 ,  Hg,  Hg  of  H(f)  are  defined  which  are  respectively 

fully  parametric,  fully  non- parametric,  and  parametric-select.  Significance 
levels  for  a  are  obtained  by  Monte  Carlo  methods.  The  family  of 
parametric- select  estimators  of  A  may  provide  optimum  tests  of  F  (such  as 
normal  or  exponential)  and  estimators  of  F  when  one  rejects  F0.  0 

KEY  WORDS:  Entropy-based  statistical  inference,  goodness  of  fit  tests, 
test  for  normality,  Shapiro-Wilk  statistic,  quantile,  density-quantile, 
quantile-density,  autoregressive  density  estimator. 


1.  INTRODUCTION.  Let  X1,...,Xn  be  a  random  sample  of  a  continuous 

random  variable  X  with  distribution  function  F(x)  a  Pr[X<x],  -•»<x<*»»  and 

quantile  function  Q(u)  *  F_1(u),  0<u<l.  Tests  of  normality  or  exponential ity 
are  special  cases  of  a  locatlon-scTl?  parameter  model,  which  we  denote  by  the 
hypothesis 

Ho:  F(x)  »  P0(^)»  Q(u)  «  w  +  o  Qq(u) 

where  FQ(x)  is  a  specified  distribution  with  quantile  function  Qq(u).  Table  1 
lists  FQ  and  for  various  standard  distributions. 
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Table  1.  STANDARD  DISTRIBUTION  FUNCTIONS 
AND  QUANTILE  FUNCTIONS 


Name 

F0<«) 

V“> 

Normal 

♦U)  *  /*«♦(*) »' 

♦(x)  «  (2ir) exp  -  h  x2 

•_1(u) 

Exponential 

1  -  e’x 

log  (1-u)"1 

Wei bull. 

Quantile  shape 

parameter  a 

1  -  e'xC  .  x  »  0 

{log  (1-u)*1}0 

Extreme  value 

of  minimum 

m* 

1  -  e  e 

— <x<» 

log  log  (1-u)’1 

Extreme  value 

of  maximum 

-»<x<« 

-  log  log  u’1 

Log  normal 

♦(log  x),  x>0 

exp  e’^u) 

Logistic 

l-(l+ex)_1 

p 


Many  statistics  have  been  introduced  by  statisticians  to  test  the 
composite  (location  and  scale  parameters  unspecified)  hypothesis  of  normality. 
A  superior  omnibus  test  of  normality  (in^terms  of  power)  seems  to  be  provided 
by  a  test  statistic  U  =  02/01  ,  where  01  and  02  are  scale  estimators  defined 
as  follows:  01  1$  sample  standard  deviation,  while  02  is  a  linear  combination 
of  order  statistics  estimator  of  0.  We  call  W  a  statistic  of  Shapiro-Wilk 
type  because  it  is  a  variant  of  a  test  introduced  by  Shapiro  and  Wllk  (1965) 
and  Shapiro  and  Francia  (1972). 

The  question  arises:  to  discover  a  motivation  for  the  W  statistic  which 
explains  the  source  of  its  power,  and  to  use  this  insight  to  extend  W  to 
other  distributions  F  .  In  this  paper  we  propose  that  the  power  of  W  can  be 

explained  by  representing  it  as  an  "entropy  difference"  test  statistic.  We 
show  that  the  test  statistic  for  normality  introduced  by  Vasicek  (1977)  is 
also  an  entropy  difference  statistic,  as  are  test  statistics  introduced  in 
Parzen  (1979). 

2.  INFORMATION  DIVERGENCE  AND  ENTROPY.  To  compare  two  distribution 
functions  F(x)  and  G(x)  with  probability  densities  f(x)  and  g(x),  a  useful 
measure  is  information  divergence,  defined  by 


Kf;g)  =  /Tj-l09  dx 


It  can  be  decomposed  into  cross-entropy 
H(f;g)  *  /^{-log  9(x)}  f(x)  dx 
and  entropy 

H(f)  =  H(f;f)  *  /^{-log  f(*)J  f(x)  dx 


by  the  important  identity 

0  <  X(f;9>  =  H(f;g)  -  H(f). 


€*1 


To  estimate  entropy  it  Is  useful  to  express  it  in  terms  of  the  quantile 
density  function  q(u)  and  density-quantile  function  fQ(u)  defined  by 


q(u)  *  Q'(u),  fQ(u)  =  f(Q(u))  *  {q(u)>‘ 


Accession  For 
NTIS  GRAM 


By  making  the  change  of  variable  u  *  F(x)  one  can  show  that  unannounced 


H(f)  ■  /l  -  log  fQ(u)  du 


/'  log  q(u)  du. 


Justification _ . 

By - - - 

Distribution/ 
Availability  Codos 
Avail  f.rd/or 
Dist  Special 


Under  the  hypothesis  HQ  that  F(x)  *  FQ((x-y)/o),  a  location-scale  model, 
q(u)  *  oq0(u)  and 

H(f)  -  log  a  +  H(f0). 

3.  ENTROPY  DIFFERENCE  TO  TEST  GOODNESS  OF  FIT.  To  test  the  hypothesis 
Hq  we  propose  to  Investigate  (and  eventually  establish  how  to  use  optimally) 

test  statistics  which  are  entropy-difference  statistics 
A(f)  *  H°(f)  -  H(f) 

where  H°(f)  is  a  parametric  evaluation  of  the  entropy  of  f,  evaluated  under 
the  assumption  that  It  obeys  HQ,  defined  by 

H°(f )  =  log  o  +  H(f0), 

while  H(f)  is  a  non- parametric  evaluation  of  H(f),  usually  most  conveniently 
obtained  by 

H(f)  =  /J  log  q(u)  du  . 

To  estimate  H(f)  we  have  three  types  of  estimators  which  we  call 

A 

H.j  fully  parametric  estimator, 

H2  fully  non- parametric  estimator, 

A 

H3  smooth  or  parametric  select  estimator 

Similarly  to  estimate  H°(f)  we  have  several  types  of  estimators  depending  on 
the  estimator  oi  we  adopt  for  o;  thus 

H°j  =  log  oj  +  H(f0) 

Three  important  possibilities  for  Sj  are: 
o i  maximum  likelihood  estimator, 

o2  optimal  linear  combination  of  order  statistics  estimator 
o3  estimator  of  score  deviation  03  =  ^0Q0(U)  q(°)  du. 

Under  these  estimators  are  all  asymptotically  efficient  estimators  of  a. 


V..I 


t-v 


H2  Is  a  fully  non- parametric  estimator  of  H(f)  based  on  the  gap  or  leap  (of 
order  2v)  estimator 


a(J+v)  - 


of  q(j/(n+l)),  and 


j=v+l .... ,n-v 


Some  significance  levels  of  Ai2  are  given  in  Table  2;  they  are  transformations 

of  the  significance  levels  given  by  Vasicek  (1977)  and  obtained  by  Monte-Carlo 
simulation. 

6.  ENTROPY-DIFFERENCE  INTERPRETATION  OF  PARZEN  GOODNESS  OF  FIT  PROCEDURE 

To  test  the  general  hypothesis  H  :  X  is  F  (^l*-),  Parzen  (1979)  proposes 
forming  raw  estimators  3(u)  of  0  0  ° 

d(u)  =  i-  f0Q0(u)  q(u)  , 

°o 

where  aQ  3  f0Q6U)  q(t)  dt.  To  form  d(u)  and  aQ  we  replace  q(u)  by  the 
least  smooth  gap  estimator  q2  (u).  Smooth  estimators  dm(u)  of  d(u)  are 
formed  by  the  autoregressive  method.  From  estimators  of  the  pseudo-correlations 

p(v)  *  /q  e2iriuv  d(u)  du,  v=0,+l , . . . ,+m 
one  estimates  the  coefficients  of  the  autoregressive  order  m  approximator 

V“>-  J  1  W1>‘2"“  e2,i"l'2 

to  d(u).  The  coefficient  1C  plays  an  Important  role  in  entropy  calculations 
since  m 

II  -  log  dm(o)  do  *  -log  1^ 

A  1  A 

can  be  regarded  as  an  estimator  a33  *  fQ  -  log  d(u)  du  of  A. 

This  formula,  which  we  prove  below,  provides  an  entropy-difference 
interpretation  of  the  goodness  of  fit  procedures  In  Parzen  (1979). 

To  prove  this  interpretation  of  a33,  write 
-  log  d(u)  -  log  oQ  -  log  f0Q0(u)-log  q(u) 


«r 
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the  value  of  m  which  minimizes  AIC(m)  Is  chosen  as  an  "optimal"  value  m. 

An  optimal  parametric- select  estimator  of  the  true  quantile-density  function 
q(u)  is 

%  <“>  ■  »o  3m(u)  "o1"’  • 


7.  CONCLUSION 

We  believe  that  the  interpretation  given  in  this  paper  of  powerful 
goodness  of  fit  procedures  as  entropy- difference  statistics  provides  a 
striking  demonstration  of  the  FUN.STAT  synthesis  of  statistical  reasoning. 

In  addition  to  elegance  of  the  theory,  very  practical  and  implement able 
procedures  are  obtained. 

* 

The  parametric  select  estimators  a33  of  entropy-difference  test 

statistics  for  goodness  of  fit  have  for  m»l  approximately  the  properties  of 
fully  parametric  estimators  (such  as  Shaplro-Wllk  and  have  for  large 
values  of  m  approximately  the  properties  of  fully  non-parametric  estimators 
(such  as  Vasicek  a,2).  Thus  It  appears  the  series  a33  m  provide  all  the  test- 
statistics  required.  Further  the  autoregressive  approlch  provides 
non-parametric  estimators  of  the  true  distribution  when  one  rejects  the  null 
hypothesis  HQ* 

One  may  find  that  a  sample  passes  the  goodness  of  fit  procedure  for  two 
null  hypotheses.  An  appealing  procedure,  whose  properties  remain  to  be 
investigated, is  to  choose  that  null  hypothesis  for  which  a33  is  always  less 
than  the  corresponding  statistic  for  the  other  hypothesis. 

The  entropy-difference  statistics  a33  m  are  implemented  in  our  one-sample 

univariate  data  analysis  computer  program  ONESAM.  Table  3  lists  auto¬ 
regressive  estimates  of  entropy-difference  when  testing  for  normality  data 
sets  in  Stigler  (1977).  An  asterisk  indicates  a  data  set  which  is  not 
normal  in  our  judgement. 

A 

In  Table  2  we  report  significance  levels  for  a^2  obtained  (by  Monte  Carlo 
calculations)  by  Dudewlcz  and  van  der  Muelen  (1981)  in  the  case  of  testing  for 
uniformity  rather  than  normality. 

The  closeness  of  the  Dudewlcz- van  der  Muelen  levels  to  the  Vasicek 
levels  suggests  a  conjecture,  which  remains  to  be  proved,  that  the  entropy- 
difference  statistics  have  distributions  which  are  approximately  the  same 

for  all  null  hypotheses  HQ:  X  Is  FQ (~“) * 

A  final  noteworthy  feature  Is  that  the  autoregressive  method  of 
estimating  quantile-density  functions  and  density-quantile  functions. 
Introduced  In  Parzen  (1979),  can  be  shown  to  have  a  maximum  entropy 
property  [compare  Parzen  (1982)]. 
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Table  2.  5%  SIGNIFICANCE  LEVELS  FOR  ENTROPY  DIFFERENCE  STATISTICS 


Accept  Hq:  X  is  N(u,o2)  for  some  v  and  o  if  entropy  difference  is  less  than 
threshold  given. 


Sampl e 
Size  n 

An 

Shapiro- 

Wilk 

A 

A33,m 

Autoregressive  order  m 
Monte  Carlo  5%  level 
(rough  approximation  2m/n) 

m=l 

m 

HI 

m=5 

.05 

.141 

.235 

.299 

Ipffl 

.398 

.40 

.40 

.43 

.61 

(.10) 

(.20) 

(.30) 

(.50) 

(.43 

.43 

.47 

.66) 

50 

.023 

.045 

IS 

BH 

.176 

.21  .21 

.23 

(.04) 

B 

B 

(.20) 

(.22  .22 

.24) 

Shapiro-Wilk  and  Vasicek  levels  are  based  on  Monte  Carlo  simulation  of  normal; 
Dudewicz-van  der  Muelen  levels  are  based  on  Monte  Carlo  simulation  of  uniform. 


One  can  conjecture  a  relation  between  gap  order  2v  and  autoregressive  order 
m  for  the  corresponding  estimators  to  have  similar  distributions  and  therefore 
similar  significance  levels: 

(2v)  m  =  n  =  sample  size 

To  understand  what  this  conjecture  is  alleging  note  that  for  n=20,  m=4  is 
similar  to  2v  =  6;  for  n=50,  m=6  is  similar  to  2v  =  8. 

When  one  uses  gap  estimators  of  q(u),  and  thus  of  entropy,  one  has  the 
problem  of  determining  the  order  2v.  One  can  more  easily  develop  criteria 
for  determining  the  order  m  of  autoregressive  estimators  of  q(u). 


n 
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